LLMSys Papers appeared in OSDI'24 and SOSP'23
SOSP'23
At SOSP'23, there were 9 papers on machine learning systems. Among them, three focused on traditional machine learning system optimization, one on schedulers. For LLM applications, there was one paper on compiler optimization for matrix computation, two on inference, and two on fault tolerance. Due to the continuous expansion and innovation of computing scale and methods, fault tolerance has become a new hotspot in the second half of 2023.
compilation
PIT: Optimization of Dynamic Sparse Deep Learning Models via Permutation Invariant Transformation
inference
Paella: Low-latency Model Serving with Software-defined GPU Scheduling by Kelvin K.W. Ng (University of Pennsylvania), Henri Maxime Demoulin (DBOS, Inc.) and Vincent Liu (University of Pennsylvania)
Implement an AI compiler integrated with a scheduler client to optimize the execution efficiency of CUDA kernels. Compared to the existing SOTA, it increases the GPU throughput by 11 times and reduces the latency by 58 times.
vLLM: Efficient Memory Management for Large Language Model Serving with PagedAttention by Woosuk Kwon (UC Berkeley), Zhuohan Li (UC Berkeley), Siyuan Zhuang (UC Berkeley), Ying Sheng (Stanford University), Lianmin Zheng (UC Berkeley), Cody Hao Yu (Independent Researcher), Joseph Gonzalez (UC Berkeley), Hao Zhang (UC San Diego) and Ion Stoica (UC Berkeley)
By optimize the KV Cache in matrix calculations, vLLM framework can accelerate the inference speed of large language models.
fault tolerance
GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints by Zhuang Wang (Rice University), Zhen Jia (Amazon Web Services, Inc.), Shuai Zheng (Amazon Web Services), Zhen Zhang (Amazon Web Services), Xinwei Fu (Amazon Web Services), T. S. Eugene Ng (Rice University) and Yida Wang (Amazon)
By designing a hierarchical memory system for checkpoints (in-memory and persistent storage), GEMINI accelerated failure recovery by a factor of 13 compared to SOTA.
Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates by Insu Jang (University of Michigan), Zhenning Yang (University of Michigan), Zhen Zhang (Amazon Web Services), Xin Jin (Peking University) and Mosharaf Chowdhury (University of Michigan)
In pipeline parallelism, reconfiguring the pipeline parallelism can accelerate failure recovery, which is faster than retraining from a checkpoint.
OSDI'24
Over half are LLM Serving.
Large Language Model (LLM) Serving
Llumnix: Dynamic Scheduling for Large Language Model Serving
- From Alibaba
dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving
- From PKU
- Related work: S-LoRA [MLSys'24], Punica [MLSys'24], SqueezeLLM [ICML 2024], GPTQ
DistLLM: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
- From PKU
- Related work: Splitwise [ISCA'24]
Fairness in Serving Large Language Models
- From UC Berkeley
ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models
- From UoE
- SpotServe [ASPLOS'24], Parcae [NSDI'24]
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
- From GaTech
- Related work: DeepSpeed SplitFuse [Microsoft]
Parrot: Efficient Serving of LLM-based Applications with Semantic Variable
- From SJTU & MSRA
Deep Learning Training and Inference Optimization
Bitter: Enabling Efficient Low-Precision Deep Learning Computing through Hardware-aware Tensor Transformation
- From MSRA
Cuber: Constraint-Guided Parallelization Plan Generation for Deep Learning Training
- From USTC & MSRA
- Related work: Alpa [OSDI'22]Unity [OSDI'22],Galvatron[VLDB'23]
Automatic and Efficient Customization of Neural Networks for ML Applications
- From UChicago
MonoInfer: Enabling a New Monolithic Optimization Space for Neural Network Inference Tasks on Modern GPU-Centric Architectures
- From USYD
Enabling Tensor Language Model to Assist in Generating High-Performance Tensor Programs for Deep Learning
- From HNU & Huawei
USHER: Holistic Interference Avoidance for Resource Optimized ML Inference
- From MSR
Distributed and Online Learning
Caravan: Practical Online Learning of In-Network ML Models with Labeling Agents
- From Stanford
Dynamic Scheduling of ML Training across Geo-Distributed Datacenters: Principles and Experiences
- From OSU
Main reference
- https://www.zhihu.com/question/623328589/answer/3284535941 (孙挺Sunt)
- https://www.zhihu.com/question/649626302/answer/3440577248 (Hsword)